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q ( 57 ) Abstract: The present invention provides a new method and an apparatus for spectral envelope encoding. The invention teaches 
how to perform and signal compactly a time/frequency mapping of the envelope representation, and further, encode the spectral 
© envelope data efficiently using adaptive time/frequency directional coding. The method is applicable to both natural audio coding 
^ and speech coding systems and is especially suited for coders using SBR [WO 98/57436] or other high frequency reconstruction 
^ methods. 
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SUMMARY OF THE INVENTION 

The present invention provides a new method, and an apparatus for spectral envelope coding. The coding 
scheme is designed to meet the special requirements of systems, where the residual signal within certain 
frequency regions is excluded from the transmitted data. Examples are systems employing HFR (High 
5 Frequency Reconstruction), in particular SBR (Spectral Band Replication), or parametric coders. In one 
implementation, non-uniform time and frequency sampling of the spectral envelope is obtained by 
adaptively grouping subband samples from a fixed size filterbank, into frequency bands and time 
segments, each of which generates one envelope sample. This allows instantaneous selection of arbitrary 
time and frequency resolution within the limits of the filterbank. The system defaults to long time 

10 segments and high frequency resolution. In the vicinity of transients, shorter time segments are used, 
whereby larger frequency steps can be used in order to keep the data size within limits. In order to 
maximize the benefits of the non-uniform sampling in time, variable length of bitstream frames or 
granules are used. The variable time/frequency resolution method is also applicable on envelope 
encoding based on prediction. Instead of grouping of subband samples, predictor coefficients are 

1 5 generated for time segments of varying lengths according to the system. 

The invention describes two schemes for signalling of the time and frequency resolution used. The first 
scheme allows arbitrary selection, by explicit signalling of time segment borders and frequency 
resolutions. In order to reduce the signalling overhead, four classes of granules are used, offering 

20 different cost/flexibility tradeoffs. The second scheme exploits the property of a typical programme 
material, that transients are separated at least by a time T nmin , in order to reduce the number of control 
bits further. Hereby, a transient detector in the encoder, operating on a time interval Tj ei <= T nm i n , equal 
to the nominal granule length, determines the position of the onset of a possible transient. The position 
within the interval is encoded and sent to the decoder. The encoder and decoder share rules that specify 

25 the time/frequency distribution of the spectral envelope samples, given a certain combination of 
subsequent control signals, ensuring an unambiguous decoding of the envelope data. 

The present invention presents a new and efficient method for scalefactor redundancy coding. A dirac 
pulse in the time domain transforms to a constant in the frequency domain, and a dirac in the frequency 
30 domain, i.e. a single sinusoid, corresponds to a signal with constant magnitude in the time domain. 

Simplified, on a short term basis, the signal shows less variations in one domain than the other. Hence, 
using prediction or delta coding, coding efficiency is increased if the spectral envelope is coded in either 
time- or frequency-direction depending on the signal characteristics. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

spirit of the invention, preference to the accompanying drawings, » wmch: 

^.U.lbmustrateunifonnrespecdvenon^^ 

Figs 2a - 2b define, and illustrate usage of four classes of granules. 

Figs. 3a - 3b are two examples of granules, and the corresponding control signals. 

Figs. 4a - 4c illustrate the position signalling system. 

Fie 5 illustrates time/frequency switched delta coding. 

F.g.Tisablockd.agramofadecoderusmg the envelope coding according to the invents 



INSCRIPTION OF PREFERRED EMBODIMENTS 

of description and explanation of the embodiments herein. 
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spectral envelopes. E.g. amplification factors in an envelope adjusting fllterbank are calculated as the 
square root of the quotients between original signal and transposed signal average power. For this kind of 
signal, a problem arises: The transposed signal has the same "chord-to-transient" power ratio as the 
lowband. The gains needed in order to adjust the transposed transients to the correct level thus cause the 
transposed chords to be amplified relative to the original highband level for the full duration of the 
envelope data containing transient energy. These momentarily too loud chord fragments are perceived as 
pre- and post echoes to the transient, see Fig. la. This kind of distortion will hereinafter be referred to as 
"gain induced pre- and post echoes". The phenomenon can be eliminated by constantly updating the 
envelope data at such a high rate that the time between an update and an arbitrarily located transient is 
guaranteed to be short enough not to be resolved by the human hearing. However, this approach would 
drastically increase the amount of data to be transmitted and is thus not feasible. 

Therefore a new envelope data generation scheme is presented. The solution is to maintain a low update 
rate during tonal passages, which make up the major parts of a typical programme material, and by means 
of a transient detector localize the transient positions, and update the envelope data close to the leading 
flanks, see Fig lb. This eliminates gain induced pre-echoes. In order to represent the decay of the 
transients well, the update rate is momentarily increased in a time interval after the transient start. This 
eliminates gain induced post-echoes. The time segmenting during the decay is not as crucial as finding 
the start of the transient, as will be explained later. In order to compensate for the smaller time steps, 
larger frequency steps can be used during the transient, keeping the data size within limits. A non- 
uniform sampling in time and frequency as outlined above is applicable both on filterbank- and linear 
prediction-based envelope coding. Different predictor orders may be used for transient and quasi- 
stationary (tonal) segments. 

In case of prediction based coders, no elaborate time/frequency resolution switching schemes are known 
from prior art. However, some filterbank based coders employ variable time/frequency resolution. This 
is commonly achieved through switching of the filterbank size. Such a change in size can not take place 
immediately, so called transition windows are required, and thus the update points can not be chosen 
freely. When using SBR or any other HFR method, the objective is different - a filterbank can be 
designed to meet both the highest temporal and highest frequency resolution needed, to extract an 
adequate envelope representation. Thus, the non-uniform time and frequency sampling of the spectral 
envelope, can be obtained by adaptive grouping of the subband samples from a fixed size filterbank, into 
"frequency bands" and "time segments". One envelope sample is then calculated per band and segment. 
Throughout the description below, "frequency resolution" refers to a specific set of frequency bands, LPC 
coefficients or similar, used in the envelope estimate for a particular time segment. In other words, from 
an envelope coding perspective, high frequency resolution or high time resolution can be obtained 
instantaneously. 
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As will be shown below, many of the states described by Eq. 1 are not very likely, and would also 
generate too large amounts of envelope data to be practical at a limited bitrate. 

The minimum time-span between consecutive transients in music programme material can be estimated 
5 in the following way: In musical notation, the rhythmic "pulse" is described by a time signature 

expressed as a fraction A/B, where A denotes the number of "beats" per bar and l/B is the type of note 
corresponding to one beat, for example a 1/4 note, commonly referred to as a quarter note. Let t denote 
the tempo in Beats Per Minute (BPM). The time per note of type 1/C is then given by 

T n = (60/t)*(B/C)[s] (Eq2) 

10 Most music pieces fall within the 70- 160 BPM range, and in 4/4 time signature the fastest rhythmical 
patterns are for most practical cases made up from 1/32 or 32:nd notes. This yields a minimum time 
Tnmin = (60/160)*(4/32) = 47 ms. Of course lower time periods than this may occur, but such fast 
sequences ( > 21 events per second) almost get the character of buzz and need not be fully resolved. 

1 5 The necessary time resolution T q must also be established. In some cases a transient signal has its main 
energy in the highband to be reconstructed. This means that the encoded spectral envelope must carry ail 
the "timing" information. The desired timing precision thus determines the resolution needed for 

encoding of leading flanks. T q is much smaller than the minimum note period T nm i m since small time 
deviations within the period clearly can be heard. In most cases however, the transient has significant 
20 energy in the lowband. The above described gain-induced pre-echoes must fall within the so called pre- 
or backward masking time T m of the human auditory system in order to be inaudible. Hence T q must 
satisfy two conditions: 

T q « T nmin (Eq 3) 

T q < T m (Eq 4) 

25 Obviously T m < T nm i n (otherwise the notes would be so fast that they could not be resolved) and 

according to ["Modeling the Additivity of Nonsimultaneous Masking", Hearing Res., vol. 80, pp. 105- 
118 (1994)], T m amounts to 10-20 ms. Since T nmin is in the 50ms range, a reasonable selection of T q 
according to Eq 3 results in that the second condition is also met. Of course the precision of the transient 
detection in the encoder and the time resolution of the analysis/synthesis filterbank must also be 

30 considered when selecting T q . 
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classes, the next signal describes the location of the variable boundary, expressed as the offset from the 
nominal position. This boundary is referred to as the "absolute border". The segment borders within the 
granules are described by means of "relative borders": The absolute border is used as a reference, and the 
other borders are described as cumulative distances to the reference. The number of relative borders is 
5 variable, and is signalled to the decoder, after the absolute border. A zero number means that the granule 
comprises one time segment only. Thus, in case of class FixVar, the segment lengths are signalled in a 
reversed sequence, moving away from the absolute border at the end of the granule. The length of the first 
segment in a FixVar granule is derived from the relative borders and the total length, and is not signalled. 
Class VarFix relative border signals are inserted into the bitsream in a forward sequence, whereby the last 
10 segment length is excluded. The bitstream signal order is identical to that of class FixVar, that is: [class, 
abs. border, number of rel. borders, rel. border 0, rel. border 1 , . . . , rel. border N - 1 ] In the figure, the 
signals are shown in "clear text" instead of the actual binary code words sent in the bitstream. 

Fig 3b shows an alternative coding of the signal. The variable boundary offers versatility when grouping 
15 the segments at a given global grid. Thus some pay load control can be performed at this level, e.g. to 
equalize the number of bits per granule. This may ease the operation of the lowband encoder. Given 
enough look-ahead, a multipass encoding can be performed, and the optimum combination of local grids 
be used. 

20 In order to reduce the symbol set for signalling of relative borders, and thereby the number of bits per 
symbol, those lengths can be quantized to an integer multiple ( >1) of T qi if the absolute border has the 
precision T q . In this case the absolute border, in addition to the above function, serves to align a group of 
borders around the transient with the precision T q . In other words, the highest precision is always 
available for coding of transient leading flanks, and a coarser resolution is used in the tracking of the 

25 decay. 

The VarVar class frames use a combination of the FixVar and VarFix signalling, e.g. interleaved: [class, 
abs. bord. left, d:o right, num. rel. bord left, d:o right, [rel. bord. left 0,. . ., rel. bord. left TV - 1], [d:o 
right]]. This class offers the greatest flexibility in the local grid selection, at the cost of an increased 
30 signalling overhead. Finally, the FixFix class does not require other signals than the class signal per se, in 
which case for example two (equal length) segments are used. However, it is feasible to add a signal that 
enables selection within a set of predefined grids. For example, the spectral envelope can be calculated 
for two segments, and if the two envelopes do not differ more than a certain amount, only one set of 
envelope data is sent. 

35 
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This system may be viewed as a finite state machine, where the above described signals control the 
transitions from state to state, and the states define the local grids. Clearly, the states can be represented 
by tables, stored in both the encoder, and the decoder. Since the grids are hard coded, the ability to 
adaptively alter the payload has been sacrificed. A reasonable approach is to keep the time/frequency 
5 data matrix size (e.g. number of power estimates) approximately constant. Assuming that the number of 
scalefactors or coefficients in a high resolution segment is two times that of a low resolution segment, one 
high resolution segment can be traded for two low resolution segments. 



Time/Frequency Switched Scalefactor Encoding 
10 Utilising a time to frequency transform it can be shown that a pulse in the time domain corresponds to a 
flat spectrum in the frequency domain, and a "pulse" in the frequency domain^ i.e. a single sinusoidal, 
corresponds to a quasi-stationary signal in the time domain. In other words a signal usually shows more 
transient properties in one domain than the other. In a spectrogram, i.e. a time/frequency matrix display, 
this property is evident, and can advantageously be used when coding spectral envelopes. 

15 

A tonal stationary signal can have a very sparse spectrum not suitable for delta coding in the frequency- 
direction, but well suited for delta coding in the time-direction, and vice versa. This is displayed in Fig. 
5. Throughout the following description a vector of scale factors calculated at time no represents the 
spectral envelope 

20 Y(k, n 0 ) = [ a ]s a 2 , a 3 , a*, a„], (Eq 5) 

where ai . . .a// are the amplitude values for different frequencies. Common practice is to code the 
difference between adjacent values in the frequency-direction at a given time, which yields: 

D(k f no) = [ a 2 - ai , a 3 - a 2 , . - . , a/v- a ( ^ . i ) ]. (Eq 6) 

In order to be able to decode this, the start value a\ needs to be transmitted. As stated above this delta- 
25 coding scheme can prove to be most inefficient if the spectrum only contains a few stationary tones. This 
can result in a delta coding yielding a higher bit rate than regular PCM coding. In order to deal with this 
problem, a time/frequency switching method, hereinafter referred to as T/F-coding, is proposed: The 
scalefactors are quantized and coded both in the time- and frequency-direction. For both cases, the 
required number of bits is calculated for a given coding error, or the error is calculated for a given number 
30 of bits. Based upon this, the most beneficial coding direction is selected. 
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The decoder side of the invention is shown in Fig. 7, using SBR transposition as an example of generation 
of the missing residual signal. The demultiplexer 701 restores the signals and feeds the appropriate part 
to an audio decoder 702, which produces a low band digital audio signal. The envelope information is 
fed from the demultiplexer to the envelope decoding block 703, which, by use of control data, determines 
in which direction the current envelope are coded and decodes the data. The low band signal from the 
audio decoder is routed to the transposition module 704, which generates a replicated high band signal 
from the low band. The high band signal is fed to an analysis filterbank 706, which is of the same type as 
on the encoder side. The subband signals are combined in the scalefactor grouping unit 707. By use of 
control data from the demultiplexer, the same type of combination and time/frequency distribution of the 
subband samples is adopted as on the encoder side. The envelope information from the demultiplexer 
and the information from the scalefactor grouping unit is processed in the gain control module 708. The 
module computes gain factors to be applied to the subband samples before recombination in the synthesis 
filterbank block 709. The output from the synthesis filterbank is thus an envelope adjusted high band 
audio signal. This signal is added to the output from the delay unit 705, which is fed with the low band 
audio signal. The delay compensates for the processing time of the high band signal. Finally, the 
obtained digital wideband signal is converted to an analogue audio signal in the digital to analogue 
converter 710. 
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1. A method for specl envelope codutg in . source codmg system, where said sysKm composes « 
a,, operabons performed after stomge or sion, and where a rustdua! signal cotrespondtng 

said decoder, characterised by: 

at said encoder, perform a statistical analysis of the input signal, 

based on the outcome of sa,d analysis, select the gnd to be used in the spectral envelope 

representation, 

10 using sa,d grid, generate data representing sa,d spectral envelope, 

transmit said data together with a control signal describing said gnd, and 

at said decoder, using said control s.gnal and said data in the synthesis of the output stgnal. 

2 Amethodaccording to claim 1, characterised in ^to-^^t^"^ 
15 isobtamedbygroupmgofelemen^ 

calculating a scalefactor for every one of said groups. 

3. Amethod accordmg to claim2, characterised in that s«d Wftequency representation is generated 
by a filterbank. 

4. A method according to claim 3, cbaracrisad i» dad said fitabank ia of fixed size. 

5. A method according ,o clatm . , characterised in that said da,a is general hy a linear predictor. 

25 6. A medtod according to Cairn 1, ebaracrised .» that said analysis employs a bansien, detector. 

7. A method accordmg ,o claim d, c— ed ,n ,ha, said ,n_ reso,nrion ~*<^ 
detanl, combmadon of mgher fre q »ency resoluuoo and lower ttme resolution to , combmadon of .owe, 
frequency reaolndon and higher time resolution at the onset of a transtent. 

encoder and said decoder. 

9. A method accordmg to claim 8, characterised in that at most one position per granule is signalled. 



20 



30 



35 



SUBSTITUTE SHEET (RULE 26) 



WO 01/26095 



14 



PCT/SEOO/01887 



10. A method according to claim 1, characterised in that granules of variable length are used. 

11. A method according to claim 10, characterised in that four classes of granules are used, whereby 
the first class has fixed position granule boundaries, and the length L 9 

5 the second class has a fixed position start boundary, and a variable position stop boundary, 
the third class has a variable position start boundary, and a fixed position stop boundary, 
the fourth class has variable position start and stop boundaries, and 

said fixed positions coincide with reference positions, separated by the distance L f and said variable 
positions can be offset [-a,b] versus said reference positions. 

10 

12. A method according to claim 2, characterised in that said scalefactors are coded both in the time and 
frequency direction, the momentarily most beneficial direction is determined, said most beneficial 
direction is used for said transmission. 

15 13. A method according to claim 12, characterised in that the direction which generates the least coding 
error for a given number of bits is chosen. 

14. A method according to claim 12, characterised in that the direction which generates the least number 
of bits for a given coding error is chosen. 

20 

15. A method according to claim 14, characterised in that lossless coding is employed and separate 
tables are used for said time and frequency directions, in particular where said tables are used for 
selection of coding direction. 

25 16. An apparatus for encoding of a spectral envelope of a signal to be decoded by a decoder, 
characterised by: 

means for performing a statistical analysis of the input signal, 

means for selection of the instantaneous time and frequency resolution to be used in a spectral 
envelope representation of said input signal, based on the outcome of said analysis, 
30 means for generation of data representing said spectral envelope, using said resolution, and 

means for transmission of said data together with a control signal describing said resolution. 
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17. An apparatus for decoding of a spectral envelope of a signal encoded by an encoder, characterised 

^ m eans for interpretation of a received control signal in order to determine the instantaneous time 
and frequency resolution used in a spectral envelope representation of an encoded signal, 
5 means for decoding of received envelope data based on said spectral envelope representation, usmg 

said control signal, and 

means for using said decoded envelope data in the synthesis of the output signal. 
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Uniform sampling in time 




Scalefactor generation time 



Fig. la 
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Non-uniform sampling in time 




Scalefactor generation time 

Fig. lb 
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class = 0 (FixFix) <=> both boundaries fixed 

h N * 



4- 



0 N 

class = 1 (FixVar) <=> leading boundary fixed, trailing d:o variable 

h max length M 

1 h a b - h 



n N 



class = 2 (VarFix) <=> leading boundary variable, trailing d:o fixed 
x max length w 



-w b m 



N 



class = 3 (VarVar) <=> both boundaries variable 

M max length 

m a m b m w a— 



5 n 
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subgranule 
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